Accelerating Iterative SpMV for Discrete Logarithm 

Problem using GPUs 



Abstract — In the cryptanalytic context, computing discrete 
logarithms in large cyclic groups using index-calculus-based 
methods, such as the number field sieve or the function field 
sieve, requires solving large sparse systems of linear equations 
modulo the group order. Most of the fast algorithms used to solve 
such systems — e.g., the conjugate gradient or the Lanczos and 
Wiedemann algorithms — iterate a product of the corresponding 
sparse matrix with a vector (SpMV). This central operation can 
be accelerated on GPUs using specific computing models and 
addressing patterns, which increase the arithmetic intensity while 
reducing irregular memory accesses. In this work, we investigate 
the implementation of SpMV kernels on NVIDIA GPUs, for 
several representations of the sparse matrix in memory. We 
explore the use of Residue Number System (RNS) arithmetic to 
accelerate modular operations. We target linear systems arising 
when attacking the discrete logarithm problem on groups of 
size 160 to 320 bits, which correspond to the sizes in the last 
record computations. The proposed SpMV implementation on 
a 1.7M-by-1.7M matrix containing 86M non-zeros, delivers a 
throughput of 24.3 SpMV/s on an NVIDIA GeForce GTX 580, 
which corresponds to 50 GFLOP/s. 

Index Terms — Discrete Logarithm Problem, Sparse-Matrix- 
Vector product, Modular Arithmetic, Residue Number System, 
GPUs 

I. Introduction 

The security of many cryptographic protocols used for au- 
thentication, key exchange, encryption, or signature, depends 
on the difficulty of solving the discrete logarithm problem 
(DLP) in a given cyclic group 0. For instance, we can rely 
on the hardness of the DLP in a multiplicative subgroup 
of a finite field. There are algorithms, such as Pollard-rho 
0, Baby-Step/Giant-Step that solve the problem in time 
exponential in the subgroup size. Another family of methods, 
known as Index-calculus methods propose to solve it in 
time sub-exponential in the finite field size. These algorithms 
require in their linear algebra step the resolution of large 
sparse systems of linear equations modulo the group order 
0. In cryptographic applications, the group order I is of 
size 160 to 320 bits. The number of rows and columns of 
the corresponding matrices are in the order of hundreds of 
thousands to millions, with only hundreds or fewer non-zero 
elements per row. This step is a serious limiting factor for the 
resolution. For example, it was reported in (6| that the linear 
algebra step of the Function Field Sieve (FFS) implementation 
to solve the DLP over GF(3 6x97 ) took 80.1 days on 252 CPU 
cores, which represents 54% of the total time. 

To solve such systems, ordinary Gaussian elimination is 
inefficient. While some elimination strategies aiming at keep- 
ing the matrix as sparse as possible can be used to reduce 
the input system somewhat, actual solving calls for the use 
of other techniques (Lanczos 0, Wiedemann [8 1) that take 
advantage of the sparsity of the matrix [9|. Either for Lanczos, 



Wiedemann or their block variants, the iterative sparse-matrix- 
vector product is the most time-consuming operation. For this 
reason, we investigate accelerating this operation on GPUs. 

The paper is organized as follows. Section HI] presents the 
background related to the hardware and the context. Sectionlllll 
discusses the arithmetic aspects of our implementation. We 
present several matrix formats and their corresponding imple- 
mentations in Section[lV] We compare the results in SectionM 
and present optimizations based on hardware considerations in 
Section [VI] Finally, Section [VTlI concludes the paper. 

II. Background 

A. GPUs and the CUDA programming model 

CUDA is the hardware and software architecture that en- 
ables NVIDIA GPUs to execute programs written with C, 
C++, OpenCL and other languages iflOll . 

A CUDA program instantiates a host code running on CPU 
and a kernel code running on GPU. The kernel code runs 
according to the Single Program Multiple Threads (SPMT) 
execution model across a set of parallel threads. The threads 
are executed in groups of 32, called warps. As a warp only 
has a single instruction fetch/decode unit, each instruction in 
the execution path is issued to all the threads in the warp. 
However, if one or more threads have a different execution 
path, execution divergence occurs. The different paths will 
then be serialized, negatively impacting the performance. 

The threads are further organized into thread blocks and 
grids of thread blocks: 

• A thread executes an instance of the kernel. It has a 
unique thread ID within its thread block, along with 
registers and private memory. 

• A thread block is a set of concurrent threads which can 
share data through shared memory and perform barrier 
synchronization which ensures that all threads within that 
block fully reach the same instruction before continuing. 
It has a unique block ID within its grid. 

« A grid is an array of thread blocks executing the same 
kernel. All the threads of the grid can also read inputs, 
and write results to global memory. 

At the hardware level, the blocks are distributed on an 
array of multi-core Streaming Multiprocessors (SMs). Each 
SM schedules and launches the threads in groups of warps. 
Recent GPUs (such as the NVIDIA Fermi architecture) allow 
for up to 48 active warps per SM. The ratio of active warps 
to the maximum supported is called occupancy. Maximizing 
the occupancy is important, as it helps hiding the memory 
latency. One should therefore pay attention to the usage of 
shared memory and registers in order to maximize occupancy. 
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Another important performance consideration in program- 
ming for the CUDA architecture is coalescing global memory 
accesses. To understand this requirement, global memory 
should be viewed in terms of aligned segments of 32 words 
of 32 bits each. Memory requests are serviced for one warp 
at a time. If the warp request hits exactly one segment, the 
access is fully coalesced and there will be only one memory 
transaction performed. If the warp accesses scattered locations, 
the accesses are uncoalesced and there will be as many 
transactions as the number of hit segments. Consequently, 
the kernel should use a coalescing-friendly pattern for greater 
memory efficiency. 

Despite their high arithmetic intensity and their large mem- 
ory bandwidth, GPUs provide small caches. In fact, Fermi 
GPUs provide the following levels of cache for the on-board 
DRAM (local and global memory): 

. 768-kByte L2-cache per GPU. 

• 16-kByte Ll-cache (per SM). It can be extended to 48 
kBytes, but this decreases shared memory from 48-kByte 
to 16-kByte. 

• A texture cache: an on-chip cache for the read-only 
texture memory. It can accelerate the memory accesses 
when neighboring threads read from nearby addresses. 

B. Sparse-Matrix-Vector product on GPUs 

Sparse-matrix computations pose some difficulties on 
GPUs, such as irregular memory accesses, load unbalance and 
low cache efficiency. Many works have focused on choosing 
suitable matrix formats and appropriate kernels to overcome 
the irregularity of the sparse matrix ifTTI . Ifl2l . lfT~3l . These 
works have explored implementing efficiently SpMV for nu- 
merical values. Schmidt et al. f\M proposed an optimized 
matrix format to accelerate exact SpMV over GF(2), that can 
be used in the linear algebra step of the Number Field Sieve 
(NFS) for integer factorization. Boyer et al. f 15] have adapted 
SpMV kernels over small finite fields and rings Z/toZ, where 
they used a double to represent the ring. In our context, since 
the order of the finite ring is large (hundreds of bits), specific 
computing models and addressing models should be used. 

In this work, we consider the product w <— A x v, where 
A is a A-by-M sparse matrix. A feature of the application 
context is that A contains small values (e.g. 32-bit integers). 
v and w are dense M -coordinate and A-coordinate vectors, 
respectively. We denote by tt-nz the number of nonzero ele- 
ments in A. Note that, in our case, no specific assumption is 
made about the structure of the matrix (such as the locations 
of the non-zeros). 

We compute the dot product between each row of A and 
the vector v. The basic operation is of the form X <— (X + 
XY) mod £, where where A is a non-zero of A, and X and Y 
are coordinates of the vectors w and v, respectively. To mini- 
mize the number of costly reductions modulo I, we accumulate 
computations, and postpone the final modular reduction of 
the result. When iterating many products (computations of the 
form A % v), we can further accumulate several SpMVs before 
reducing, as long as the intermediate results do not exceed the 
largest representable integer. As far as arithmetic over "LjlTL 



is concerned, we chose to use the Residue Number System, 
which appears to be more suited to the fine grained parallelism 
inherent to the SPMT computing model than the usual multi- 
precision representation of large integers. A comparison of the 
two representations will be proposed in subsection IV-DI 

III. Residue Number System and modular 

ARITHMETIC 

A. A brief recall 

The Residue Number System (RNS) is based on the Chinese 
Remainder Theorem (CRT). Let B — {p\,P2, ■ ■ ■ ,Pn) be a set 
of co-prime integers, which we call an RNS-basis. We define 
P as the product of all the p^s. The RNS uses the fact that 
an integer X within [0, P — 1] can be uniquely represented by 
the list (xi,X2, ■ ■ ■ ,x n ), where Xi is the residue of X modulo 
Pi, which we write as Xi = \X\ Pi . 

This number system is particularly interesting for arithmetic 
over large integers, since it distributes the computation over 
several small residues. In other words, the computational units 
that will work on the residues are independent and need 
no synchronization nor communication, as there is no carry 
propagation 1161 . ifPTl . 

If X and Y are given in their RNS representations Xrns = 
\X\ , . . . , x n ) and Y RNS = (yi,...,y n ), according to B, 
and such that X,Y < P, RNS addition and multiplication 
are realized by modular addition and multiplication on each 
component: 

XrnS +RNS YRNS = +Ul\pi, ■ ■ ■ , \%n + Vn\p„) 

Xrns x rns Y RNS = (\xi xyi\ Pl ,...,\x n x y n \ Pn ) 

The result (e.g., X+Y) should belong to the interval [0, P— 1] 
if we want to obtain a valid RNS representation. Otherwise, 
it will be reduced modulo P. 

B. RNS on GPU 

We represent the finite ring X/IX as the integer interval 
[0, ^ — 1]. Each element is stored in its RNS form. For 
performance considerations, we opted for 64-bit moduli. Each 
RNS residue is represented as an unsigned 64-bit integer in 
the range [0,pt — 1]. 

We implement the RNS operations in PTX (parallel thread 
execution) pseudo-assembly language for CUDA lfl8l . The 
basic RNS operation is z; <— (xi + Xyi) mod pi, where 
< Xi,yi,Zi < pi are RNS residues. To speed up the 
reduction modulo pi, the moduli are of the pseudo-Mersenne 
form 2 64 — a, with c, as small as possible. With this particular 
form, the reduction can be done in a small number of additions 
and (half) products |fl9l . 

To make our implementation more efficient, we propose to 
relax the condition on the input and output: Xi, z% G [0, 2 64 — 1]. 
Since 2 64 = Cj mod pi, we consider the high part of Xi + Xyi 
— i.e the part that exceeds 2 64 — , which we denote by t. We 
have (xi + Xyi) = ((x, + Xyi) mod 2 64 ) + t x a (mod pi). 
If the right-hand side exceeds 2 64 , as it is below than 2 x 2 64 , 
Cj is in the same way added to the low part. The comparison 
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with pi is thus replaced by the overflow detection, which is 
cheaper, and only a is stored instead of pi (cf. Algo. |2). 

For the problem sizes considered, taking a < 2 s is 
sufficient. The algorithm is valid while |t x c»| < 2 64 . This 
ensures correctness for |A| up to 2 56 . 



Algorithm 1: RNS Add 

Input : < x l < 2 64 , < y t < p % 
Input : q < 2 8 , such that p, = 2 64 - c» 
Output: Zi = Xi + yi (mod pi), < Zi < 2 64 

1 z t <- \xi + yi\ 2 Bi II Xi + yi < 2 64 ■ 

2 if carry then 

3 Zi i Zi -\- Ci 



Algorithm 2: RNS AddMul 



Input : < x, < 2 64 , < y t < p z , < A < 2 56 
Input : Q < 2 8 , such that pi = 2 64 - a 
Output: Zi = Xi + Axj/, (mod pi), < Zi < 2 64 

1 Zi (xj + A x yi) L //the low 64-bit part 

2 t <— (xj + A x yi) H //the high 64-bit part 

3 Z% <- \Zi + Ci X t| 2 64 //Cj x t < 2 64 

4 if carry then 



Zj + Ci 



C. RNS modular reduction 

In the chosen RNS representation, (P — 1) is the largest 
representable integer. So in the case of iterative SpMVs, we 
can accumulate at most log(j^j)/ log(r) matrix-vector prod- 
ucts before having to reduce modulo I, where r corresponds 
to the largest row norm (defined as the sum of the absolute 
values of its elements) in the matrix. To reduce the vector dst 
modulo I, there are two options: 

• Convert it to the conventional form (reconstruction), then 
reduce it and convert the reduced vector back to its RNS 
form (evaluation). 

> Perform the modular reduction in RNS. 

For the first option, the steps of reconstruction, reduction 
and evaluation are time-consuming (much longer than the 
SpMV). To further increase the number of possible products 
before reduction, we can increase P, but this would make the 
RNS operations more expensive. 

The second option allows us to remain in the RNS form 
throughout the solver computation. However, the RNS repre- 
sentation has the drawback that modular reduction is difficult 
to perform, compared to addition and multiplication. There are 
in the literature algorithms for RNS modular multiplication 
ED, iBTl . that we adapt here to derive a method to perform 
modular reduction. 

We start from the CRT reconstruction: 

n 

X = J2 JiPi mod P, 

i=l 

where we have defined Pi = P/pi and 7j = XjPj - . 



Let us also define the integer a as follows 
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p 



E 



pi 



(i) 



X can be written as 7^ — aP and, since 7^ < Pi, we 

i=l 

have that < a < 1. 

Since pi rj 2 fc , in El . Kawamura et al. approximate the 
quotient ji/pi using only the s most significant bits of 7i/2 fc . 
Hence, an estimate for a is proposed as 



?4: 



E 



>k — s 



A 



(2) 



where s is an integer parameter in [l,k] and A an error 
correcting term in ]0, 1[. 



The following result is derived from 



Proposition 1. // < X < (1 - A)P and (e + 5) < A < 1 

™ c 2 k ~ s - 1 
where e = > — r and 6 = n ; , then a = a. 

i=i 



2 fc 



Proof: To measure the inaccuracy induced by the approx- 
imation of ji/pi, we define the error terms 



e > = p- l -¥ iind5 > = ¥ 



li 



We have 



ifc— S 



(3) 



As Ei,S q > 0, 



I Ti I 

2 s ~ ^ ' 

i=l i=l rt 



Since 7^ < pj, an upper bound is given by 

_ 7i / 2 fc -p.A 2 fc - Pt c, 
£l pi I 2 fe J < 2 fc 2 fc 



(4) 



Moreover, as ji 
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Ok—i 



x 2 k ~ s < 2 k ~ s 



1, we obtain 



2 k-s 



X 2 



k — 5 



-yk—s 



< 



2 k 



Eq. |4] and the introduction of the previous inequalities in 
Eq.[5]lead to 



l_2 fe " s J 



<y2_. 
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The substitution of a + — for — and the addition of A 



result in: 



X 



a + — -(e + <5)+A < 



E 



I2 k - S l 



X 
P 



A < a+ — + A. (5) 



As X < (1 - A)P, it follows that: 

I H 
n — : — 

l2 k ~ s 



a < 



i=l 



A < 



Now, let us define Z as 7, |Pj|^ — |aP|^. We can easily 

i=l 

check that Z lives in [0, n2 k £ — 1] and is congruent to X 
modulo I. After determining a, we are able to perform a 
fully RNS computation of Z. Algorithm [3] summarizes the 
steps of computation. We denote by P and T the RNS forms 



(IJT 1 



pi 



p _ 



) and (71, ... , 7„), respectively. 



Algorithm 3: RNS modular reduction 



Input : k, £, n, A, s, B, P precomputed 
Input : precomputed table |P^L for i = 1 . , .n 
Input : precomputed table |aPh for a = 1. . .1 
Input : positive X < (1 - A) P in RNS form 
Output: Z = X (mod £), Z < n2 k £ 

1 r <- x x p 

2 for i = 1 . . . n do 

3 Q <-ji x |Pi|^ 



4 Z 



5 a ■(— 



6 Z 



=1 

n 

E 

i=l 

Z-\aP\ 



[ 2 ks\ 



III RNS multiplication 

III RNS multiplication 
//(n - 1) RNS additions 

//sum of n s-bit terms 
//l RNS subtraction 



All the operations can be evaluated in parallel on the 
residues, except for step 3, where a broadcast of the whole 
vector r is needed. 

Even if the obtained result Z is not the exact reduction of X, 
it is bounded by n2 k £. So, we guarantee that the intermediate 
results of the SpMV computation do not exceed a certain 
bound less than P. Notice that this RNS reduction algorithm 
imposes that P be one modulus (k bits) larger than implied 
by the earlier condition £ < P. 

IV. Sparse Matrix Storage Formats 

A. Compressed Sparse Row (CSR) 

The CSR format stores the column indices and the values 
into two arrays of 71-nz elements: id and data. A third array 
of row pointers, ptr, of length N + 1, is used to indicate 
the beginning and the end of each row. Non-zeros are sorted 
by their row indices. The CSR format eliminates the explicit 
storage of the row index, and is convenient for a direct access 
to the matrix, since ptr indicates where each row starts and 



ends in the other two ordered arrays. In the pseudo-code, the 
vectors src and dst represent v and w, respectively. 

1) Scalar: To parallelize the product for the CSR format, 
a simple way is to assign one thread for each row (scalar 
approach) (cf. listing [TJ. Temporary results are stored on 
registers and the final result is written to global memory. In 
the running sum, the RNS AddMul is performed in PTX. For 
simplicity, it is denoted here by the operators "*", "+" and 
"%". 

The major drawback of the scalar approach is that making 
each thread iterate over all the RNS residues increases the 
number of registers per thread. This approach also suffers from 
a poor global memory load efficiency, because the threads 
within a same warp access in a non-contiguous fashion id 
and data. 

global void 

spmv_csr_scalar_keniel ( const int * data , const int * id , 
const int * ptr , const int * p , 
const uint2 * src , uint2 * dst ) 

{ 

// one thread per row 

int row = BLOCK_SIZE * blockldx.x + threadldx.x; 
if ( row >= nRows ) return; 

uint2 vals[n]; // array of n (the # of RNS moduli) 64— bit words 

for (int j = 0; j < n; ++j) 
vals [j ] =0; // initialization 

// running sum 

for(int i = ptr [row]; i < ptr[row+l]; ++i) 
for (int j = 0; j < n; ++j) 

vals[j] =(vals[j] + data[i] * src[id[i] *n + j]) % p[j]; 

for (int j = 0; j < n; ++j) 
dst [row * n + j] = vals[j]; // from registers to global memory 

} 

Listing 1. SpMV kernel for the scalar CSR format 

2) Vector: The vector approach consists in assigning a 
warp to each row of the matrix. The threads within a warp 
access neighboring non-zeros elements, which makes the warp 
accesses to id and data contiguous. Each thread computes 
its partial result in shared memory, then a parallel reduction in 
shared memory is required to combine the per-thread results. 
No synchronization is needed, since threads belonging to a 
same warp are implicitly synchronized. 

However, in the context of RNS arithmetic, the vector 
scheme still suffers from a low load/write efficiency when 
accessing src and dst, because threads within the same warp 
simultaneously access residues of different entries, which are 
not contiguous. 

3) Residue-vector: To overcome the limitations of the pre- 
vious approaches, we propose to organize the threads within 
a warp into tiqps groups of n threads, where tiqps x n (that 
we denote by ncus) is closest to 32. Each group is associated 
to a non-zero (cf. Listing |2j. For instance for n — 6, we take 
n>GPS = 5, so the first 6 threads process in parallel residues 
of the 1 st non-zero, threads 6 to 11, process the 2 nd non- 
zero, and so on, and we will have two idle threads. Like for 
the vector approach, a reduction is needed to combine the 
results of threads working on the same residue and belonging 
to different groups. 
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In this scheme, which we call residue-vector, the number 
of registers is reduced by eliminating the per-thread loop to 
process all the residues of an entry, and contiguous accesses 
to src and dst are performed. 

global void 

spmv_csr_residue_vector_kernel ( const int * data , const int * id , 

const int * ptr, const int * p, 
const uint2 * src , uint2 * dst ) 

{ 

_shared_ uint2 vals [(BLOCK_SIZE / 32) * NCUS]; 

int tid = BLOCK_SIZE * blockldx.x + threadldx.x; 

int warp_id = tid / 32; 

int lane = tid & 31; 

int resid_id = lane % n; // residue index within the entry 

int vals_id = (threadldx .x / 32) * NCUS + lane; 

int row = warp_id; // one warp per row 

if (row >= nRows) return; 

if (lane >= NCUS) return; //idle threads 

uint2 val =0; 

for(int i = ptr [row] + lane / n; i < ptr[row+l]; i+=NGPS) 
val = (val + data[i] * src[id[i] * n + resid_id ]) % p[resid_id]; 

vals [ vals_id ] = val ; // from registers to shared memory 

if (lane < n) {// first group 
for (int k = 1; k < NGPS; ++k) 

val = (val + vals[vals_id + k * n]) % p[resid_id]; // reduction 
dst [row * n + residue_id] = val; 

} 

} 

Listing 2. SpMV kernel for the residue-vector CSR format 



B. Coordinate (COO) 

For the COO format, the row/column indices and the value 
are explicitly stored to specify a non-zero entry. The format 
consists of three arrays row_id, col_id and data of ?t,nz 
elements. In this work, we propose to sort the entries by their 
row index. 

A typical way to work with the COO format on GPU is 
to assign one thread to each entry. This implies that different 
threads from different warps will process the same row. To 
combine their results, one possibility is to do atomic updates 
on global memory to the result vector dst, which significantly 
decreases the performances. Another possibility is that each 
thread computes its partial dot product, then performs a 
segmented reduction 11231 . 11241 to sum the partial results of 
the other threads belonging to the same warp and spanning the 
same row. We followed the scheme proposed by the library 
cuspS, which performs the segmented reduction in shared 
memory, using the row indices as segment descriptors. As for 
the residue-vector CSR kernel, we assign a group of threads to 
each non-zero. Each warp iterates over its interval, processing 
NGPS elements at a time. If a spanned row is terminated, 
its result is written to dst, otherwise, the row index and 
the partial dot product are stored in temporary arrays. Then, 
a second kernel performs the combination of the per-warp 
results. 

1 http ://code. google, com/p/cusp-library/ 



C. Sliced Coordinate (SLCOO) 

The SLCOO format was inspired from the CADO-NFS@ 
software for CPUs. It was introduced for GPUs by Schmidt et 
al. for integer factorization, in the particular case of matrices 
over GF(2) lfl4l . The aim of this format is to increase the 
cache hit rate that limits the CSR and COO performances. 
Like COO, the SLCOO representation stores the row indices, 
column indices and values. It divides the matrix into horizontal 
slices, where the non-zeros of a slice are sorted according 
to their column indices with the aim to reduce the irregular 
accesses on source vector src, if they had been sorted by 
their row indices. A fourth array ptrSlice indicates the 
beginning and end of each slice. 

For the SLCOO kernel (cf. Listing [3), one warp works on 
a slice. A group of n threads processes in parallel a non-zero, 
each thread being assigned to a residue. Since each thread 
works on more than one row, it needs to have individual 
storage for its partial per-row results, or to be able to have 
exclusive access to a common resource. In |[T4l . where a thread 
block had been assigned to each slice, three possibilities have 
been mentioned to solve this issue: 

• Small SLCOO: each thread has one exclusive entry in 
shared memory to store the partial result for each row. 

• Medium SLCOO: threads having the same lane (index 
within the warp) share one entry per row in shared 
memory and access it by an atomic XOR operation. 

• Large SLCOO: all threads share one entry per row. 
The Medium SLCOO allows one to put (BLOCK_SIZE/32) 
times as many rows on one slice than the Small SLCOO. 
Similarly, the Large SLCOO allows one to fill 32 times as 
many rows than the Medium SLCOO. This way, increasing 
the number of rows per slice (from Small to Large variants) 
increases the texture cache hit rate, which compensates the 
drawbacks of the atomic accesses. 

However, because of the reduction, it is not possible to 
perform an atomic addition in RNS. For this reason, we only 
implement the Small SLCOO. Furthermore, this kernel suffers 
from the excessive usage of shared memory, since it stores 
Z/£Z entries. 

global void 

spmv_slcoo_kernel( const int * data, const int * row_id, 
const int * col_id , const int * ptrSlice , 

const int * p, const uint2* src, 
uint2* dst) 

{ 

_shared_ uint2 vals [SLICE_LENG * (BLOCK_SIZE / 32) * NCUS]; 

int tid = BLOCK_SIZE * blockldx.x + threadldx.x; 
int warp_id = tid / 32; 
int lane = tid & 31; 
int resid_id = lane % n; 

int vals_id = (threadldx.x / 32) * NCUS + lane; 

if (lane >= NGPS * n) return; 

for (int j = 1; j < SLICE_LENG; ++j) 
vals [j * (BLOCK_SIZE / 32) * NCUS + vals_id] = 0; 

for(int i = ptrSlice [warp_id] + lane / n; i < ptrSlice [warp_id + 1]; 
i+= NGPS) 

2 http://cado-nfs. gforge.inria.fr/ 
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vals [(row_id[i] % SLICE_LENG) * (BLOCK_SIZE / 32) * NCUS + 
vals_id] += datafi] * src[col_id[i] * n + resid_id] % p[resid_id]; 

if (lane < n) { 

// reduction in shared memory 
for (int j = 1; j < SLICE_LENG; ++j) 
for (int k = 1; k < NGPS; ++k) 

vals [j * (BLOCK_SIZE / 32) * NCUS + vals_id] += valsfj * ( 
BLOCK_SIZE / 32) * NCUS + vals_id + k * n] % p[ 
resid_id] ; 

// from shared memory to global memory 
for (int j = 1; j < SLICE_LENG; ++j) 
dst [(warp_id * SLICE_LENG + j) * n + resid_id] = vals[j * ( 
BLOCK_SIZE / 32) * NCUS + vals_id]; 

} 

} 

Listing 3. SpMV kernel for the SLCOO format 

There are in the literature other SpMV formats that we 
did not consider such as DIA (Diagonal), ELL (ELLPACK), 
etc. The DIA stores the offset of each sub-diagonal from 
the main diagonal and is appropriate only for matrices that 
satisfy some sparsity patterns, which is not our case. The ELL 
format extends the CSR arrays to N-by-K arrays, where K 
corresponds to the maximum number of non-zeros per row. 
For the matrices that we use, this storage scheme provide no 
advantage over the CSR scheme. 

V. Comparative Analysis of SpMV Kernels 

All the experiments are run on a system comprising an 
NVIDIA GeForce GTX 580 and an Intel Xeon Processor 
E5440 (2.83 GHz). Each SpMV kernel is executed 100 times. 
We report the computational throughput in terms of GFLOP/s, 
which we determine by dividing the number of required 
operations (twice the number of non-zeros multiplied by 2 x n) 
by the running time. The memory throughput is computed 
by dividing the total number of bytes read/written by all the 
threads by the running time. Our measurements do not include 
the time spent to copy data between the host and the GPU, 
since the matrix and vectors do not need to be transferred 
back and forth in each SpMV iteration. These delays are 
thus dwarfed by the computation latencies. The timings do 
not include the reduction modulo I, since it happens only 
once every few iterations and also because our implementation 
based on multi-precision representation does not support it. 

The reported measurements are based on the Nvidia Visual 
Profile^ and the NVIDIA CUDA Occupancy Calculatoi@ 
results. Table U summarizes the test matrix over Z/£Z, where 
I is a prime of up to 280 bits. The matrix was obtained from 
a DLP resolution program based on the FFS algorithm. The 
Z/iZ entries fit in five 64-bit residues. There is an extra 64-bit 
residue, needed for the modular reduction: n = 6. 



A. comparison of CSR variants 

The scalar kernel uses a large number of registers, which 
limits the maximum number of warps that can be put on an SM 
to 24 active warps (per SM), out of the 48 supported. This is 
reported on the theoretical occupancy column in Table [TT] On 

3 http://developer.nvidia.com/nvidia- visual-profiler 
4 http://developer.nvidia.com/ cuda-downloads 





FFS-280 


Size 

#Non-zeros 
Max (row norm) 
Percentage of ±1 
Size of £ (bits) 
Size of M (bits) 
Size of n2 k £ (bits) 


1,732,788 X 1,732,788 
86,639,540 
374 
93.48% 
280 
384 
346 



TABLE I 

Properties of used matrix 



the other hand, the vector kernel suffers from shared memory 
usage (4.5kB for a 96-thread block), which also limits the 
maximal reachable occupancy to 50%. For these two kernels, 
the low occupancy significantly decreases the performances. 
Compared to the vector kernel, the scalar one reaches higher 
throughput, since it allows an Ll-oriented configuration (48k- 
Ll, 16k-shared) of the on-chip memory, while for the vector 
kernel, a shared-oriented configuration (16k-Ll, 48k-shared) 
is required. 

Concerning the global memory access pattern, the column 
Global Load/Store Efficiency gives the ratio of requested 
memory transactions to the number of transactions performed, 
which reflects whether accesses are all perfectly coalesced 
(100% efficiency) or not. For scalar and vector kernels, uncoa- 
lesced accesses cause the bandwidth loss and the performance 
degradation. 

The residue-vector (CSR-RV) kernel satisfies better the 
GPU architectural characteristics. It makes the write accesses 
coalesced (100% store efficiency). For the loads, we obtain 
only 45% efficiency, because a warp accesses several non- 
zeros, which are not necessarily contiguous (sparsity of A). 

B. COO/CSR-RV comparison 

Due to the segmented reduction, the COO kernel performs 
more instructions and requires more registers than the CSR 
kernel. Thread divergence happens more often, because of 
the several branches that threads belonging to the same warp 
can take. Memory access pattern for writing is less efficient, 
compared to the CSR kernel (cf. Store Efficiency in Table PI . 
due to the fact that some results are written to dst, the rest to 
the temporary vector. Consequently, the COO kernel reaches 
lower throughput than the CSR one. 

C. SLCOO/CSR-RV comparison 

Table PI] compares several SLCOO kernels for different 
slice lengths. By making the slices larger, we increase the 
usage of shared memory proportionally to the slice length. 
For slice length above 4, the requested shared memory exceeds 
16kB, two options are possible: 

> Decrease the number of blocks on a multiprocessor and 
keep the 48k-Ll-16k-shared configuration (cf. kernels 
marked t in Table PIT). This improves the cache hit rate. 
• Keep the same occupancy and switch to 16k-Ll-48k- 
shared configuration. This limits the cache hit rate. 
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Registers 


Shared Mem. 


Branch 


(Theoretical) 


Global Load / 


Timing 


Global Load / 


Computation. 




per Thread 


per SM 


Divergence 


Occupancy 


Store Efficiency 


in ms 


Store Throughputs 


Throughput 


CSR-S 


37 





48.2% 


(50%) 45.9% 


4.2% / 12.3% 


187.23 


34.95 / 0.42 


11.1 


CSR-V 


28 


36,864 


59.8% 


(50%) 36.4% 


7.5% / 25% 


216.48 


48.79 / 0.36 


9.6 


CSR-RV 


16 


11,520 


25.9% 


(100%) 69.6% 


46.9% / 100% 


59.36 


179.4 / 6.52 


35.02 


COO 


23 


13,824 


36% 


(66.7%) 64.9% 


44% / 37.5% 


110.47 


88.36 / 0.72 


18.82 



TABLE II 

COMPARISON OF scalar, vector AND residue-vector CSR-KERNELS AND COO KERNEL. 



For slices of length 2, the 48k-Ll-16k-Shared configuration is 
sufficient, but we can not reach higher occupancy, due to the 
high number of registers per thread (24). 
For slices of length 12, only the 16k-Ll-48k-Shared configu- 
ration is possible, since each block exceeds the threshold of 
16kB. 

We remark that for both configurations, increasing the slice 
length improves the cache hit rate, since accesses on the 
source vector are less irregular. However, the limitation of the 
occupancy yields poor performance compared to the CSR-RV 
kernel. 

D. RNS and Multi-precision arithmetics comparison 

The idea behind the use of RNS arithmetic rather than 
multi-precision (MP) arithmetic is that RNS can significantly 
decrease data sharing between the threads and arithmetic 
operations required for the carry generation/propagation (cf. 
Table IIVb . Consequently, it allows us to reach higher occu- 
pancy and better performance. 

The low difference between the two kernels (< 5%) is due 
to the fact that the RNS kernel requires an extra residue for 
the modular reduction step. If that residue is removed, the 
speed-up of RNS compared MP is around 15%. 

To perform the modular reduction in RNS, we choose 
A = 0.25. For the maximum row norm that we have (374), 
this allows to do up to 4 iterative SpMVs before having to 
reduce. The reduction kernel takes 5.7 ms, which corresponds 
to 1.42 ms per iteration (~ 2.5% of the SpMV delay). 

VI. Improvements on CSR kernel 

To improve the kernel performance, one should take into 
account the GPU architectural characteristics: the management 
of the memory accesses and the partitioning of the com- 
putations. The effects and the results corresponding to each 
improvement are reported in Table [V] 

A. Texture caching 

Although our SpMV kernel suffers from irregular load 
accesses, a thread is likely to read from an address near the 
addresses that nearby threads (of the same group) read. For this 
reason, we enable the texture caching by binding the source 
vector on texture memory and by replacing reads of the form 
src [ j ] with a texture fetch texlDf etch ( src_tex, j ) . 
This improves the global memory efficiency and consequently 
the throughput. 



B. Reordering the non-zeros of a row 

When computing discrete logarithms, most of the coef- 
ficients of the matrix are ±1. It seems promising to treat 
multiplications by these coefficients differently from other 
coefficients: additions and subtractions are less expensive than 
multiplications. Moreover, we are not able to produce the same 
code for positive and negative values, so negative coefficients 
are processed differently as well. All these separations result in 
code divergence, that we fix by reordering the non-zeros in the 
matrix such that values of the same subgroup are contiguous. 
This decreases the branch divergence and increases the total 
throughput. 

C. Improving warp balancing 

In the proposed kernel, each warp processes a single row. 
This requires launching a large number of warps. Conse- 
quently, there is a delay to schedule those launched warps. 
Instead, we propose that each warp iterates over a certain 
number of rows. Although this increases the usage of registers 
(18 — > 24), it improves the achieved occupancy. To further 
increase the occupancy, we permute the rows such that each 
warp roughly gets the same work load. 

VII. Conclusion 

We have investigated different data structures to perform 
iterative SpMV for DLP matrices on GPUs. We have adapted 
the kernels for the context of large finite fields and added op- 
timizations suitable to the sparsity and the specific computing 
model. The CSR format based on the residue-vector approach 
appears to be the most efficient one. The SLCOO poses for 
the sizes that we use some hardware difficulties that nullify 
its contribution on increasing the cache hit rate. Future GPUs 
may enhance the performances. We have shown that using 
RNS for finite field arithmetic provides a considerable degree 
of independence, which can be exploited by massively parallel 
hardware. 
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Shared Memory 
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Timing 


Global Load / 


Computational 
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Hit Rate 
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Store Throughputs 
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4f 
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2 


f33 3%") 29 8% 
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122 53 / 99 
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4 
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4 


(66.7%) 61.1% 
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8t 
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1 
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8 
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3 
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12 
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2 
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48.8% 
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6 
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TABLE III 
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Registers 


Shared Memory 


Executed 


(Theoretical) 


Timing 


Computational 




per Thread 


per Block 


Instructions 


Occupancy 


in ms 


Throughput 


MP 


20 


2880 


8.05 10 8 
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62.07 


33.5 


RNS 


16 


1920 


7.73 10 8 


(100%) 69.6% 


59.36 


35.02 



TABLE IV 

Comparison of RNS and MP kernels for the CSR-RV format. 





Performance Effects 


Timing in ms 


Throughput (speedup) 


Texture caching 


Global Load Efficiency: 46.9% -> 84.7% 


44.33 


46.9 (+33%) 


Non-zeros reordering 


Branch Divergence: 25.9% 12.9% 


41.66 


49.91 (+6.4%) 


Multiple iterations 


(Theoretical) Occupancy: (100%) 69.6% -> (83.3%) 74.8% 


41.5 


50.1 (+0.4%) 


Rows permutation 


(Theoretical) Occupancy: (83.3%) 74.8% -> (83.3%) 82.5% 


41.04 


50.66 (+1.1%) 



TABLE V 

Performance Effects of the improvements on the CSR-RV kernel and their speedups. 
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